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Abstract 

Spectral Clustering as a relaxation of the normalized/ratio cut has become one of 
the standard graph-based clustering methods. Existing methods for the compu¬ 
tation of multiple clusters, corresponding to a balanced k-cut of the graph, are 
either based on greedy techniques or heuristics which have weak connection to 
the original motivation of minimizing the normalized cut. In this paper we pro¬ 
pose a new tight continuous relaxation for any balanced k-cut problem and show 
that a related recently proposed relaxation is in most cases loose leading to poor 
performance in practice. For the optimization of our tight continuous relaxation 
we propose a new algorithm for the difficult sum-of-ratios minimization problem 
which achieves monotonic descent. Extensive comparisons show that our method 
outperforms all existing approaches for ratio cut and other balanced fc-cut criteria. 


1 Introduction 

Graph-based techniques for clustering have become very popular in machine learning as they al¬ 
low for an easy integration of pairwise relationships in data. The problem of hnding k clusters in 
a graph can be formulated as a balanced A:-cut problem in El a a, where ratio and normalized 
cut are famous instances of balanced graph cut criteria employed for clustering, community detec¬ 
tion and image segmentation. The balanced fc-cut problem is known to be NP-hard H and thus in 
practice relaxations ilia or greedy approaches 10 are used for hnding the optimal multi-cut. The 
most famous approach is spectral clustering Q, which corresponds to the spectral relaxation of the 
ratio/normalized cut and uses fc-means in the embedding of the vertices found by the hrst k eigen¬ 
vectors of the graph Laplacian in order to obtain the clustering. However, the spectral relaxation has 
been shown to be loose for fc = 2 IS] and for A: > 2 no guarantees are known of the quality of the 
obtained fc-cut with respect to the optimal one. Moreover, in practice even greedy approaches ||6| 
frequently outperform spectral clustering. 

This paper is motivated by another line of recent work Enniiniini where it has been shown that 
an exact continuous relaxation for the two cluster case (k = 2) is possible for a quite general class of 
balancing functions. Moreover, efficient algorithms for its optimization have been proposed which 
produce much better cuts than the standard spectral relaxation. However, the multi-cut problem has 
still to be solved via the greedy recursive splitting technique. 

Inspired by the recent approach in Ha, in this paper we tackle directly the general balanced fc-cut 
problem based on a new tight continuous relaxation. We show that the relaxation for the asymmetric 
ratio Cheeger cut proposed recently by M is loose when the data does not contain k well-separated 
clusters and thus leads to poor performance in practice. Similar to ns we can also integrate label 
information leading to a transductive clustering formulation. Moreover, we propose an efficient 
algorithm for the minimization of our continuous relaxation for which we can prove monotonic 
descent. This is in contrast to the algorithm proposed in llT3l for which no such guarantee holds. 
In extensive experiments we show that our method outperforms all existing methods in terms of the 
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achieved balanced A:-cuts. Moreover, our clustering error is competitive with respect to several other 
clustering techniques based on balanced fc-cuts and recently proposed approaches based on non¬ 
negative matrix factorization. Also we observe that already with small amount of label information 
the clustering error improves significantly. 


2 Balanced Graph Cuts 

Graphs are used in machine learning typically as similarity graphs, that is the weight of an edge 
between two instances encodes their similarity. Given such a similarity graph of the instances, the 
clustering problem into k sets can be transformed into a graph partitioning problem, where the goal 
is to construct a partition of the graph into k sets such that the cut, that is the sum of weights of the 
edge from each set to all other sets, is small and all sets in the partition are roughly of equal size. 

Before we introduce balanced graph cuts, we briefly fix the setting and notation. Let G{V, W) 
denote an undirected, weighted graph with vertex set V with n = \V\ vertices and weight matrix 
W G with W = W'^. There is an edge between two vertices i,j G V if Wij > 0. The 

cut between two sets A, B G V m defined as cut(A, B) = jgb write I .4 for the 

indicator vector of set A C L. A collection of k sets (Ci,..., Cfc) is a partition of L if Ci = V, 
CiCi Cj = 0 if i j and \Ci\ > \,i = 1,..., fc. We denote the set of all fc-partitions of L by Pk- 
Furthermore, we denote by the simplex {x:a;GK^, x > 0, ELi = !}■ 

Finally, a set function 5 : 2^^ —> M is called submodular if for all A,BgV, S{AUB) + S{AD B) < 
S'(A) + S{B). Furthermore, we need the concept of the Lovasz extension of a set function. 

Definition 1 Let S : 2^ ^ M. be a set function with 5(0) = 0. Let f G be ordered in increasing 
order /i < /2 5 ■ • • < /n and define Ci = {j & V \ fj > fi\ where Cq = V. Then S : -A M 

given by, S{f) = Yh=i f^{s{Ci_i) - S{Ci)^, is called the Lovasz extension of S. Note that 

5(lyi) = S{A) for all A C V. 

The Lovasz extension of a set function is convex if and only if the set function is submodular m. 
The cut function cut(C', C), where C = V\C, is submodular and its Lovasz extension is given by 


2.1 Balanced fc-cuts 


The balanced fc-cut problem is defined as 


min 

{Ci,...,Ck)ePk 


E 


cvLi{C^,Ci) 

S{Ci) 


BCut(Ci,...,Cfc) 


( 1 ) 


where S 2^ ^ IR+ is a balancing function with the goal that all sets Ci are of the same “size”. 
In this paper, we assume that 5(0) = 0 and for any C C V, C 7 ^ 0, S{C) > m, for some m > 0. 
In the literature one finds mainly the following submodular balancing functions (in brackets is the 
name of the overall balanced graph cut criterion BCut(C'i,..., Cfc)), 

5(C) = I C|, (Ratio Cut), (2) 

5(C) = min{|C|, |C|}, (Ratio Cheeger Cut), 

5(C) = min{(A: — 1)|C|, C} (Asymmetric Ratio Cheeger Cut). 

The Ratio Cut is well studied in the literature e.g. Eciia and corresponds to a balancing function 
without bias towards a particular size of the sets, whereas the Asymmetric Ratio Cheeger Cut recently 
proposed in US has a bias towards sets of size (5(C) attains its maximum at this point) which 
makes perfect sense if one expects clusters which have roughly equal size. An intermediate version 
between the two is the Ratio Cheeger Cut which has a symmetric balancing function and strongly 
penalizes overly large clusters. For the ease of presentation we restrict ourselves to these balancing 
functions. However, we can also handle the corresponding weighted cases e.g., 5(C) = vol(C) = 
Eigc where di = Ej=i '^ij’ leading to the normalized cMf||4l. 
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3 Tight Continuous Relaxation for the Balanced /c-Cut Problem 

In this section we discuss our proposed relaxation for the balanced fc-cut problem ([T]). It turns out 
that a crucial question towards a tight multi-cut relaxation is the choice of the constraints so that 
the continuous problem also yields a partition (together with a suitable rounding scheme). The 
motivation for our relaxation is taken from the recent work of iiiioiini, where exact relaxations 
are shown for the case k = 2. Basically, they replace the ratio of set functions with the ratio of 
the corresponding Lovasz extensions. We use the same idea for the objective of our continuous 
relaxation of the /c-cut problem Q which is given as 

(3) 

i = 1,... ,n, (simplex constraints) 

V* G I, (membership constraints) 

I = 1,... ,k, (size constraints) 

where S is the Lovasz extension of the set function S and m = mince v, c ^0 S{C). We have 
TO = 1, for Ratio Cut and Ratio Cheeger Cut whereas m = k — 1 for Asymmetric Ratio Cheeger 
Cut. Note that TV is the Lovasz extension of the cut functional cut(C, C). In order to simplify 
notation we denote for a matrix F G by Fi the Lth column of F and by the i-th row 

of F. Note that the rows of F correspond to the vertices of the graph and the j-th column of F 
corresponds to the set Cj of the desired partition. The set / C V in the membership constraints is 
chosen adaptively by our method during the sequential optimization described in Section]^ 

An obvious question is how to get from the continuous solution F* of 0 to a partition 
(Cl ,... ,Ck) G Pk which is typically called rounding. Given F* we construct the sets, by assigning 
each vertex i to the column where the i-th row attains its maximum. Formally, 

Ci = {j G V \ i = argmaxFjs}, i = 1,..., fc, (Rounding) (4) 

S — 

where ties are broken randomly. If there exists a row such that the rounding is not unique, we say 
that the solution is weakly degenerated. If furthermore the resulting set (Ci,..., C^) do not form a 
partition, that is one of the sets is empty, then we say that the solution is strongly degenerated. 

First, we connect our relaxation to the previous work of ca for the case k = 2. Indeed for sym¬ 
metric balancing function such as the Ratio Cheeger Cut, our continuous relaxation Q is exact even 
without membership and size constraints. 

Theorem 1 Let S be a non-negative symmetric balancing function, S{C) = S{C), and denote by 
p* the optimal value of 0 without membership and size constraints for k = 2. Then it holds 

* . 

p = min > -j-. 

(Ci,c.)gP2 ^ 5(C,) 

Furthermore there exists a solution F* of 0 such that F* = [lc», where (C*,C*) is the 
optimal balanced 2-cut partition. 

Proof: Note that cut(C, C) is a symmetric set function and S by assumption. Thus with C 2 = Ci, 

cut(Ci,Ci) cut(C 2 ,C 2 ) _ ^ cut (Cl, Cl) 

5(Ci) S{C2) ~ S{Ci) 

Moreover, as TV(a/-|-/31) = |q;| TV(/) and by symmetry of 5 also 5'(a/-I-/31) = |a| S{f) (see 
ciini)- The simplex constraint implies that ^2 = 1 — Ci and thus 

TV(F 2 ) TV(l-Fi) TV(Ci) 

S{F2) ~ S{l-Fi) ~ S{Fi) • 


min y 

F=(Fi.....PG. S{Fi) 




subject to : G Ak, 

max{F(j)} = 1, 
S{Fi) > TO, 
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Thus we can write problem ([^ equivalently 


as 


. .TV(/) 

min 2 — 

/G[0,11'^ S{f) 

As for all A C V, TV(l^i) = cut(A, A) and 5'(1 a) = *5(A), we have 

. TV(/) ^ . cut(C,C) 

min < mm —^-. 

/G[0,1]1' s{f) -ccv S{C) 

However, it has been shown in ifTTl that min^g^v = mincer '^here exists 

a continuous solution such that f* = Ic*, where C* = argmin . As T"* = [/*, 1 — /*] = 

cev 

[Ic*, Ipv] this finishes the proof. □ 

Note that rounding trivially yields a solution in the setting of the previous theorem. 

A second result shows that indeed our proposed optimization problem is a relaxation of the 
balanced fc-cut problem ([T]). Furthermore, the relaxation is exact if I = V. 

Proposition 1 The continuous problem <0 is a relaxation of the k-cut problem Q. The relaxation 
is exact, i.e., both problems are equivalent, if I = V. 

Proof: For any fc-way partition (Ci,..., Cfe), we can construct F = {Ici, ■ ■ ■, )• It obviously 

satisfies the membership and size constraints and the simplex constraint is satisfied as UiCi = V 
and Ci n Cj = id if i j. Thus F is feasible for problem ([^ and has the same objective value 
because 


TV(lc)=cut(C,C), Silc) = S{C). 

Thus problem Q is a relaxation of Q. 

If / = V, then the simplex together with the membership constraints imply that each row 
contains exactly one non-zero element which equals 1, i.e., F G {0,Define for Z = 1,..., fc. 
Cl = {i G V \ Fii — 1} (i.e, Fi — Ic,), then it holds UiCi = V and Ci fi Cj = %, I j. 
From the size constraints, we have for / = 1,..., fc, 0 < m < S{Fi) = 5'(lc,) = S{Ci). Thus 
S{Ci) >0, Z = 1,..., fc, which by assumption on S implies that each Ci is non-empty. Hence the 
only feasible points allowed are indicators of fc-way partitions and the equivalence of Q and 0 
follows. □ 

The row-wise simplex and membership constraints enforce that each vertex in / belongs to exactly 
one component. Note that these constraints alone (even if / = V) can still not guarantee that F 
corresponds to a Zc-way partition since an entire column of F can be zero. This is avoided by the 
column-wise size constraints that enforce that each component has at least one vertex. 

If / = V it is immediate from the proof that problem ([^ is no longer a continuous problem as the 
feasible set are only indicator matrices of partitions. In this case rounding yields trivially a partition. 
On the other hand, if / = 0 (i.e., no membership constraints), and k > 2 it is not guaranteed 
that rounding of the solution of the continuous problem yields a partition. Indeed, we will see in 
the following that for symmetric balancing functions one can, under these conditions, show that 
the solution is always strongly degenerated and rounding does not yield a partition (see Theorem 
1^. Thus we observe that the index set / controls the degree to which the partition constraint is 
enforced. The idea behind our suggested relaxation is that it is well known in image processing that 
minimizing the total variation yields piecewise constant solutions (in fact this follows from seeing 
the total variation as Lovasz extension of the cut). Thus if |/| is sufficiently large, the vertices where 
the values are fixed to 0 or 1 propagate this to their neighboring vertices and finally to the whole 
graph. We discuss the choice of I in more detail in Section]^ 


4 








Simplex constraints alone are not sufficient to yield a partition: Our approach has been inspired 
by d who proposed the following continuous relaxation for the Asymmetric Ratio Cheeger Cut 


min 

F=(Fu-,Fk 

ixfc 


fgr:; 


E 

z=i 


TY{Fi) 


subject to : 


||Fi -quantfc_i(F,)||^ 

G Afe, i = 


(5) 


(simplex constraints) 


where S'(/) = ||/- quantj^_i(/)||^ is the Lovasz extension of S{C) = min{(A: — 1)101,(7} and 
quant;j,_]^ (/) is the k — 1-quantile of / G M". Note that in their approach no membership constraints 
and size constraints are present. 


We now show that the usage of simplex constraints in the optimization problem Q is not sufficient 
to guarantee that the solution F* can be rounded to a partition for any symmetric balancing function 
inJTJ. For asymmetric balancing functions as employed for the Asymmetric Ratio Cheeger Cut by 
11131 in their relaxation Q we can prove such a strong result only in the case where the graph is 
disconnected. However, note that if the number of components of the graph is less than the number 
of desired clusters k, the multi-cut problem is still non-trivial. 


Theorem 2 Let S{C) be any non-negative symmetric balancing function, 
relaxation 


^ TV(Fz) 
mm > —^—— 

F={Fu-,Fk), SiFi) 


Then the continuous 


( 6 ) 


subject to : G A^, i = 1,... ,n, (simplex constraints) 

of the balanced k-cut problem 0 is void in the sense that the optimal solution F* of the continu¬ 
ous problem can be constructed from the optimal solution of the 2-cut problem and F* cannot be 
rounded into a k-way partition, see 0. If the graph is disconnected, then the same holds also for 
any non-negative asymmetric balancing function. 


Proof: First, we derive a lower bound on the optimum of the continuous relaxation (|^. Then we 
construct a feasible point for (|^ that achieves this lower bound but cannot yield a partitioning thus 
finishing the proof. 


Let ((7*, (7*) = arg min ^ be an optimal 2-way partition for the given graph. Using the exact 

Cczv 

relaxation result for the balanced 2-cut problem in Theorem 3.1. in ifTTll . we have 


k 

min > 


TY{Fi) 

SiFi) 


> > mm 

/GH" 

1=1 


TV(/) 

S{f) 


E min 
Ccv 


cut ((7, (7) 

SiC) 


cut((7*, (7*) 
S{C*) 


Now define Fi = and F) = ail-^, I = 2,... ,k such that X]f =2 = 1) > 0. Clearly 

F = (Fi,..., Ffe) is feasible for the problem (|^ and the corresponding objective value is 

TV(lc.) " a;TV(l^) _ " cut(C*,^) 

5(lc.) ^ SiC*) ’ 

where we used the 1-homogeneity of TV and Fd and the symmetry of cut and S. 

Thus the solution F constructed as above from the 2-cut problem is indeed optimal for the contin¬ 
uous relaxation (|^ and it is not possible to obtain a fc-way partition from this solution as there will 
be A: — 2 sets that are empty. Finally, the argument can be extended to asymmetric set functions if 
there exists a set C such that cut ((7, (7) = 0 as in this case it does not matter that 8(0) f 8(0) in 
order that the argument holds. □ 

The proof of Theorem shows additionally that for any balancing function if the graph is discon¬ 
nected, the solution of the continuous relaxation (|^ is always zero, while clearly the solution of the 
balanced /c-cut problem need not be zero. This shows that the relaxation can be arbitrarily bad in 
this case. In fact the relaxation for the asymmetric case can even fail if the graph is not disconnected 
but there exists a cut of the graph which is very small as the following corollary indicates. 
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(a) (b) (c) (d) (e) 

Figure 1; Toy example illustrating that the relaxation of lfT3]l converges to a degenerate solution 
when applied to a graph with dominating 2-cut. (a) lONN-graph generated from three Gaussians in 
10 dimensions (b) continuous solution of (|^ from lITSll for fc = 3, (c) rounding of the continuous 
solution of 03 does not yield a 3-partition (d) continuous solution found by our method together 
with the vertices * S / (black) where the membership constraint is enforced. Our continuous solution 
corresponds already to a partition, (e) clustering found by rounding of our continuous solution 
(trivial as we have converged to a partition). In (b)-(e), we color data point i according to G K^. 


Corollary 1 Let S be an asymmetric balancing function and C* = argmin ^ and suppose 

Ccv 

thatf* := (fc- < min(Ci....,CGePf= E?=i T’/^enf/zereex/sfs 

a feasible F with Fi = 1^ and Fi = ailc*, I = 2,..., k such that = 1) ct/ > Ofor (|^ 

which has objective X/i=i S{f ) ~ which cannot be rounded to a k-way partition. 


Proof: Let Fi = 1-^ and Fi = ailc*, I = 2,... ,k such that X/i =2 = Ij ct; > 0. Clearly 

F = {Fi ,..., Ffe) is feasible for the problem and the corresponding objective value is 

" TV(f;) TV(lg^) " aiTV{lc^) 

^ SiFi) F(l^) aiS{lc^) 

^ cntjCfC^) ^ cutjCfC^) 

S(C^) S{C*) ’ 


where we used the 1-homogeneity of TV and S lfT4l and the symmetry of cut. This F cannot be 
rounded into a k-way partition as there will be fc — 2 sets that are empty. □ 

Theorem]^ shows that the membership and size constraints which we have introduced in our relax¬ 
ation ([^l are essential to obtain a partition for symmetric balancing functions. For the asymmetric 
balancing function failure of the relaxation (|^ and thus also of the relaxation Q of ifTSll is only guar¬ 
anteed for disconnected graphs. However, Corollary [T] indicates that degenerated solutions should 
also be a problem when the graph is still connected but there exists a dominating cut. We illustrate 
this with a toy example in Figure [T] where the algorithm of llT3l for solving ([^l fails as it converges 
exactly to the solution predicted by Corollary and thus only produces a 2-partition instead of the 
desired 3-partition. The algorithm for our relaxation enforcing membership constraints converges to 
a continuous solution which is in fact a partition matrix so that no rounding is necessary. 


4 Monotonic Descent Method for Minimization of a Sum of Ratios 


Apart from the new relaxation another key contribution of this paper is the derivation of an algorithm 
which yields a sequence of feasible points for the difficult non-convex problem and reduces 
monotonically the corresponding objective. We would like to note that the algorithm proposed by 
OS for (|3 does not yield monotonic descent. In fact it is unclear what the derived guarantee for 
the algorithm in lfT3l implies for the generated sequence. Moreover, our algorithm works for any 
non-negative submodular balancing function. 
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The key insight in order to derive a monotonic descent method for solving the sum-of-ratio mini¬ 
mization problem ([^ is to eliminate the ratio by introducing a new set of variables /3 = (/3i,..., /3fe). 


mm 

F=(F^,...,Fk), 

/3gr!; 


k 

E 

l^l 


Pi 


(7) 


subject to : TV(Fi) < PiS{Fi), 
F{i) G Afc, 
max{F(j)} = 1, 
S{Fi) > m, 


Vt G I, 

I = 1,... ,k. 


(descent constraints) 
(simplex constraints) 
(membership constraints) 
(size constraints) 


Note that for the optimal solution {F*, (3*) of this problem it holds TV(F;*) = I3*S{F*),1 = 
1,... ,k (otherwise one can decrease and hence the objective) and thus equivalence holds. This 
is still a non-convex problem as the descent, membership and size constraints are non-convex. Our 
algorithm proceeds now in a sequential manner. At each iterate we do a convex inner approximation 
of the constraint set, that is the convex approximation is a subset of the non-convex constraint set, 
based on the current iterate (T"*, /?*). Then we optimize the resulting convex optimization problem 
and repeat the process. In this way we get a sequence of feasible points for the original problem Q 
for which we will prove monotonic descent in the sum-of-ratios. 


Convex approximation: As S is submodular, S is convex. Let s\ G dS{Ff) be an element of the 
sub-differential of S at the current iterate F/. We have by Prop. 3.2 in llT4l . {s\)j^ = S{Ci._^) — 
S{Ci.), where ji is the index of the smallest component of Ff and Ci^ = {j G V \ {Ff)j > 
{Fl)i}. Moreover, using the definition of subgradient, we have S{Fi) > S{Fl) + (sf, F) — F*) = 


For the descent constraints, let A[ = and introduce new variables 5i = Pi — A* that capture 

the amount of change in each ratio. We further decompose Si as 6i = 6^ — > 0, > 0. 

Let M = max^g[o_i].. S{f) = maxc^y S{C), then for S{Fi) > m, 

TY{Fi) - /3iSiFi) < TY{Fi) - A‘ (s?,F;) - S+S{Fi) + S^SiFi) 

< TYiFi) - A[ (slFi) - S+m + 5^M 

Finally, note that because of the simplex constraints, the membership constraints can be rewritten 
as max{F(i)} > 1. Let i G / and define ji := argmax^ Fk (ties are broken randomly). Then the 
membership constraints can be relaxed as follows; 0 > 1 — max{F(i)} > 1 — Fij. Fij- > 1 . 

As Fij < 1 we get Fij. — 1. Thus the convex approximation of the membership constraints 
fixes the assignment of the Fth point to a cluster and thus can be interpreted as “label constraint”. 
However, unlike the transductive setting, the labels for the vertices in / are automatically chosen by 
our method. The actual choice of the set I will be discussed in Section 14.11 We use the notation 
F = {(i, ji) I i G /} for the label set generated from I (note that L is fixed once I is fixed). 


Descent algorithm: Our descent algorithm for minimizing 0 solves at each iteration t the follow¬ 
ing convex optimization problem (|^. 

k 


min 

5+GR+. <5“GR+ 




( 8 ) 


subject to : TV(Fi) < A* (^s\,Fi) +5^m — 5^M, 
F(i) G Afc, 

Fiji — 1 ; 

{s\,Fl) > m, 


Z = 1,... A:, 

V(l j*) G F, 
Z = 1,..., fc. 


(descent constraints) 
(simplex constraints) 
(label constraints) 
(size constraints) 


As its solution F*+^ is feasible for ® we update A*'*"^ = ^ and G dS{F*^^), I = 

S{Fi ) 

1,... ,k and repeat the process until the sequence terminates, that is no further descent is possible as 
the following theorem states, or the relative descent in z2i=i Y smaller than a predefined e. The 
following Theoremshows the monotonic descent property of our algorithm. 
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Theorem 3 The sequence {i^*} produced by the above algorithm satisfies 
Sf=i for all t > 0 or the algorithm terminates. 


k TV(J’‘+^) 


< 


Proof: Let <5+’ *+^,5 ’ be the optimal solution of the inner problem (HJ. By the feasi¬ 
bility of J'*'’ (5“’ and S{FI~^^) > m, 


TV(fr^) ^ Xl{slFl+^)+m6+’*+^- M6^ 
5'(F/+i) “ S{Fl+^) 


t+i 


< A* + ^ 

Summing over all ratios, we have 

^ Tv(i^/+^; 


+, t+i 


- MS 


t+i 


< A*+ 5+'*+1-5- 


1=1 


siFr) 


5(F/+i) 


z=i 


/=i 


Noting that 5+ = 5j = 0, F = _F* is feasible for ([^, the optimal value X]f=i ~ 

has to be either strictly negative in which case we have strict descent 


k 


E 


TV(F/+i) 

S{Fl+^) 


k 

1=1 


or the previous iterate F* together with 5+ = 5^ = 0 is already optimal and hence the algorithm 
terminates. □ 


The inner problem (|^ is convex, but contains the non-smooth term TV in the constraints. We 
eliminate the non-smoothness by introducing additional variables and derive an equivalent linear 
programming (LP) formulation. We solve this LP via the PDHG algorithm ifTSlfTbl . The LP and the 
exact iterates can be found in the supplementary material. 


Lemma 1 The convex inner problem ([^ is equivalent to the following linear optimization problem 
where E is the set of edges of the graph and w € are the edge weights. 


FeR 7 


min 

^Xk 
^gjjj|E|xfc^ 

'■eR+. <5-61 






subject to : {w, ai) < \\ (^s\,Fi^ + J^+m — 5; M, 

F{i) e Afc, 

Fiji — Ij 

(sJ,F/) > TO, 

fi Fii Fji f- 


l = l,...,k, 
i = l,...,n, 

y{i,ji) G L, 
l = l,...,k, 
l = l,...,k, 


(9) 


(descent constraints) 
(simplex constraints) 
(label constraints) 
(size constraints) 
y{i,j) G F. 


Proof: We define new variables ai G for each column I and introduce constraints 

{ai)ij = \{Fi)i — {Fi)j)\, which allows us to rewrite TV(F;) as {w, ai). These equality constraints 
can be replaced by the inequality constraints {ai)ij > \{fi)i — {fi)j)\ without changing the 
optimality of the problem, because at the optimal these constraints are active. Otherwise one 
can decrease {ai)ij while still being feasible since w is non-negative. Finally, these inequality 
constraints are rewritten using the fact that \x\ < y —y < x < y, for y > 0. □ 


4.0.1 Solving LP via PDHG 

Recently, first-order primal-dual hybrid gradient descent (PDHG for short) methods have been pro¬ 
posed lEiiia to efficiently solve a class of convex optimization problems that can be rewritten as 
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the following saddle-point problem 

min max {Ax, y) -I- G(a;) — (y), 

x^X yGV 

where X and Y are finite-dimensional vector spaces and A : X —>■ Y is a linear operator and G and 
<!)* are convex functions. It has been shown that the PDHG algorithm achieves good performance 
in solving huge linear programming problems that appear in computer vision applications. We now 
show how the linear programming problem 

min (c, x) 

x>0 

subject to : Aix < bi 
A2X = 62 

can be rewritten as a saddle-point problem so that PDHG can be applied. 

By introducing the Lagrange multipliers y, the optimal value of the LP can be written as 

min(c, x)-|- max {yi, Aix — bi){y 2 , A^x — b 2 ) 

x>0 vi>0, y 2 

= min max (c,x) -f Lx>o{x) + {yi,Aix) -f {y 2 ,A 2 x) - (61, yi) - (62, y2) - tyi>o(2/i), 

X vi, y2 

where r.>o is the indicator function that takes a value of 0 on the non-negative orthant and 00 
elsewhere. 

^ ^. Then the saddle point problem correspond¬ 

ing to the LP is given by 

min max (c, x) + ix>o{x) -f (y, Ax) - {b, y) - iy^>o{yi). 

X yi, y 2 ~ ~ 


““'>=( )’'<=( t) 


The primal and dual iterates for this saddle-point problem can be obtained as 

x^+^ = max{0, x^ - T{A^y'^ + c)}, 
y[+^ = max{0, yl + a{Aix'^+^ - 61)}, 

?/2^^ = 2/2 + CT(y 42 x’'+^ - 62), 

where 5;’'+^ = 2x^^^ — a;’’. Here the primal and dual step sizes r and a are chosen such that 
rcr ||A|j^ < 1, where |1.|1 denotes the operator norm. 

Instead of the global step sizes r and a, we use in our implementation the diagonal preconditioning 
matrices introduced in ifTbl as it is shown to improve the practical performance of PDHG. The 
diagonal elements of these preconditioning matrices r and cr are given by 

"G- = i .Vj e ric}, cr, = , ,V2 G nj, 

where ric are the number of rows and the number of columns of the matrix A. 

For completeness, we now present the explicit form of the primal and dual iterates of the precondi¬ 
tioned PDHG for the LP Let6» G M e M", c e u G m G 6 G yi G 
{1,..., fc} be the Lagrange multipliers corresponding to the descent, simplex, label, size and the 
two sets of additional constraints (introduced to eliminate the non-smoothness) respectively. Let 
B : —)■ be a linear mapping defined as {Bz)i = ~ G* ^ denote 

a vector of all ones. Then the primal iterates for the LP (|^ are given by 

= max {0, FI - te, i - ul)s\ + + B{y\ - ^[)) }, V( G {1,..., k}, 

=max|o,a[-T„, VZ G fc}, 

J-I-, r+l _ ^-I-, r _ I' _ ^Qr _|_ 

S~' = max |o, S~' ^ — ts- (^M9^ — 1^^ 
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where G R",l = 1,..., k, are given by (Zf)i = (’’j, if (i,f) G L and 0 otherwise. Here 
Tp i, Pa h Ps+i Ps- are the diagonal preconditioning matrices whose diagonal elements are given 
by’ 


* (1 + A*) |(s‘)i| + + pii + 1 ’ 

{.ps+)i = ^, VI e fc}, 

{n-)i = V A:}, 


Vi G {1,... ,n}, 


where di is the number of vertices adjacent to the vertex and pu = 1, if (i,l) G L and 0 
otherwise. 


The dual iterates are given by 


= max {o, e\ + ere, i ( {w, a[+i) - \\ {s\, Fl+^) - m5+’ + Mj"’ }, I = 1,..., fc, 

Cr' = G + <Tc(Fr'-l), V(t,0GL, 

i/[+i = max|o, i/[ + (T,,, |, VI G {1,... ,fc}, 

p[+i = max {O, pr + <T,. z ( - ar 1 - F]+^) }, VI G {1,..., fc}, 

^[+1 = max {O, Cl + ; ( - ar 1 ' + ^ 7 +') }, VI G {1,..., fc}, 


where 


1 


<Pe, i = 


er^ — 1, cTi^ i — 


1 


{^A) + ^Y:=rU)^ + ^ + M^ Er=il(5?)d’ 

and cr^, cr^ i, erj / are the diagonal preconditioning matrices whose diagonal elements are given by 

^ V(z,j) G E. 


From the iterates, one sees that the computational cost per iteration is 0{\E\). In our implementa¬ 
tion, we further reformulated the LP ([^ by directly integrating the label constraints, thereby reducing 
the problem size and getting rid of the dual variable 


4.1 Choice of membership constraints I 


The overall algorithm scheme for solving the problem ([T]) is given in the supplementary material. For 
the membership constraints we start initially with 1° = 0 and sequentially solve the inner problem 
From its solution we construct a F^ = {Ci,... ,Ck) via rounding, see 0. We repeat this 
process until we either do not improve the resulting balanced Ic-cut or Pj^ is not a partition. In this 
case we update and double the number of membership constraints. Let (Cj",..., be the 
current best partition. For each I G {1,..., Ic} and i G C* we compute 


cut(cr\{t}, cru{t}) 

SiCt\{i}) 


+ min 

s^l 


cut(C* U {!}, C*\{l}) 
S{Ct U {*}) 


E 

3¥=s 


cutjCj, Cj) 

SiC) 


( 10 ) 


and define Oi = {{pi, ■ ■ ■ ,P\c*\)\b*i^^ > > ■ ■ ■ > The top-ranked vertices in 

Oi correspond to the ones which lead to the largest minimal increkse in BCut when moved from 
C* to another component and thus are most likely to belong to their current component. Thus it 
is natural to fix the top-ranked vertices for each component first. Note that the rankings Oi, I = 
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1,..., fc are updated when a better partition is found. Thus the membership constraints correspond 
always to the vertices which lead to largest minimal increase in BCut when moved to another 
component. In Figure one can observe that the fixed labeled points are lying close to the centers 
of the found clusters. The number of membership constraints depends on the graph. The better 
separated the clusters are, the less membership constraints need to be enforced in order to avoid 
degenerate solutions. Finally, we stop the algorithm if we see no more improvement in the cut or 
the continuous objective and the continuous solution corresponds to a partition. 


Algorithm 1 for solving (0 


1 : Initialization: be such that F^lk = In, , I = 7 ° = 

AO, /O = 0, L = 0, p = 0 

2 : Output: partition {C*,CD 

3: repeat 

4: be the optimal solution of the inner problem ([^ 

, t+l _ TV(F/+^) ... , . t+1 


5: 

6 : 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16: 

17 

18 
19 


S(F‘+^) ’ 

cut(C*+\C*+^) 


! ^k^^) is obtained from via rounding 
if < X* ^nd ,..., is a fc-partition then 

compute new ordering Oi,yi = 1,... ,k for (Cj ",... ,C^) according to 

/*+! = Of, where Of denotes p top-ranked vertices in Oi 

L = I i e ji = argmaxF/+^)} 

j 

else 

p = max{2 |/*|, 1} (double the number of membership constraints) 

= U/=i where Of denotes p top-ranked vertices in Oi 
L = I i e ji = argmaxFf)} 


Ft+I = pt^ pt+i ^ 0 , Vz e J‘+1, Vj G {1,..., k}, Fl^ = 1, V{z, J-) G L 

TV(F/+^) 


= 


s(F)+i: 

.k 


-,l = 1,... ,k 


end if 

until = Yjd=i = 7* 


5 Experiments 

We evaluate our method against a diverse selection of state-of-the-art clustering methods like spec¬ 
tral clustering (Spec) ||71, BSpec ifTTl . Gracing 1^, NMF based approaches PNMF ifTOl . NSC ll20l . 
ONMF 1211, LSD im, NMFR US and MTV HI which optimizes (l^. We used the publicly 
available code 1231 [131 with default settings. We run our method using 5 random initializations, 7 
initializations based on the spectral clustering solution similar to (who use 30 such initializa¬ 
tions). In addition to the datasets provided in 1131 . we also selected a variety of datasets from the 
UCI repository shown below. For all the datasets not in lT3l . symmetric fc-NN graphs are built with 

Gaussian weights exp —j-), where Ux.k is the fc-NN distance of point x. We chose the 

parameters s and A: in a method independent way by testing for each dataset several graphs using all 
the methods over different choices of A; G {3, 5, 7,10,15, 20,40, 60, 80,100} and s G {0.1,1,4}. 
The best choice in terms of the clustering error across all the methods and datasets, is s = 1, A: = 15. 

Iris wine vertebral ecoli 4moons webkb4 optdigits USPS pendigits 20news MNIST 

# vertices 150 178 310 336 4000 4196 5620 9298 10992 19928 70000 

# classes 3 3 3 6 4 4 10 10 10 20 10 

Quantitative results: In our first experiment we evaluate our method in terms of solving the bal¬ 
anced A;-cut problem for various balancing functions, data sets and graph parameters. The following 

^ Since (^, a multi-level algorithm directly minimizing Rcut/Ncut, is shown to be superior to METIS Ql, we do not compare with mi. 


11 


















table reports the fraction of times a method achieves the best as well as strictly best balanced fc-cut 
over all constructed graphs and datasets (in total 30 graphs per dataset). For reference, we also report 
the obtained cuts for other clustering methods although they do not directly minimize this criterion 
in italic, methods that directly optimize the criterion are shown in normal font. Our algorithm can 
handle all balancing functions and significantly outperforms all other methods across all criteria. 
For ratio and normalized cut cases we achieve better results than ||7] [TT] [6l which directly optimize 
this criterion. This shows that the greedy recursive bi-partitioning affects badly the performance of 
HD, which, otherwise, was shown to obtain the best cuts on several benchmark datasets Il24l . This 
further shows the need for methods that directly minimize the multi-cut. It is striking that the com¬ 
peting method of 1(131 . which directly minimizes the asymmetric ratio cut, is beaten significantly by 
Graclus as well as our method. As this clear trend is less visible in the qualitative experiments, we 
suspect that extreme graph parameters lead to fast convergence to a degenerate solution. 


Ours MTV BSpec Spec Graclus | PNMF NSC ONMF LSD NMFR 


RCC-asym 

Best (%) 

80.54 

25.50 

23.49 

7.38 

38.26 

2.01 

5.37 

2.01 

4.03 

1.34 

Strictly Best (%) 

44.97 

10.74 

1.34 

0.00 

4.70 

0.00 

0.00 

0.00 

0.00 

0.00 

RCC-sym 

Best (%) 
Strictly Best (%) 

94.63 

61.74 

8.72 

0.00 

19.46 

0.67 

6.71 

0.00 

37.58 

4.70 

0.67 

0.00 

4.03 

0.00 

0.00 

0.00 

0.67 

0.00 

0.67 

0.00 

NCC-asym 

Best (%) 

93.29 

13.42 

20.13 

10.07 

38.26 

0.67 

5.37 

2.01 

4.70 

2.01 

Strictly Best (%) 

56.38 

2.01 

0.00 

0.00 

2.01 

0.00 

0.00 

0.67 

0.00 

1.34 

NCC-sym 

Best (%) 
Strictly Best (%) 

98.66 

59.06 

10.07 

0.00 

20.81 

0.00 

9.40 

0.00 

40.27 

1.34 

1.34 

0.00 

4.03 

0.00 

0.67 

0.00 

3.36 

0.00 

1.34 

0.00 

Rcut 

Best (%) 

85.91 

7.38 

20.13 

10.07 

32.89 

0.67 

4.03 

0.00 

1.34 

1.34 

Strictly Best (%) 

58.39 

0.00 

2.68 

2.01 

8.72 

0.00 

0.00 

0.00 

0.00 

0.67 

Ncut 

Best (%) 

95.97 

10.07 

20.13 

9.40 

37.58 

1.34 

4.70 

0.67 

3.36 

0.67 

Strictly Best (%) 

61.07 

0.00 

0.00 

0.00 

4.03 

0.00 

0.00 

0.00 

0.00 

0.00 


Qualitative results: In the following table, we report the clustering errors and the balanced fc-cuts 
obtained by all methods using the graphs built with fc = 15, s = 1 for all datasets. As the main goal 
is to compare to HD we choose their balancing function (RCC-asym). Again, our method always 
achieved the best cuts across all datasets. In three cases, the best cut also corresponds to the best 
clustering performance. In case of vertebral, 20news, and webkb4 the best cuts actually result in 
high errors. However, we see in our next experiment that integrating ground-truth label information 
helps in these cases to improve the clustering performance significantly. 




Iris 

wine 

vertebral 

ecoli 

4moons 

webkb4 

optdigits 

USPS 

pendigits 

20news 

MNIST 

BSpec 

E]t(%) 

23.33 

37.64 

50.00 

19.35 

36.33 

60.46 

11.30 

20.09 

17.59 

84.21 

11.82 

BCut 

1.495 

6.417 

1.890 

2.550 

0.634 

1.056 

0.386 

0.822 

0.081 

0.966 

0.471 

Spec 

Eit(%) 

22.00 

20.22 

48.71 

14.88 

31.45 

60.32 

7.81 

21.05 

16.75 

79.10 

22.83 

BCut 

1.783 

5.820 

1.950 

2.759 

0.917 

1.520 

0.442 

0.873 

0.141 

1.170 

0.707 

PNMF 

Eir(%) 

22.67 

27.53 

50.00 

16.37 

35.23 

60.94 

10.37 

24.07 

17.93 

66.00 

12.80 

BCut 

1.508 

4.916 

2.250 

2.652 

0.737 

3.520 

0.548 

1.180 

0.415 

2.924 

0.934 

NSC 

Eit(%) 

23.33 

17.98 

50.00 

14.88 

32.05 

59.49 

8.24 

20.53 

19.81 

78.86 

21.27 

BCut 

1.518 

5.140 

2.046 

2.754 

0.933 

3.566 

0.482 

0.850 

0.101 

2.233 

0.688 

ONMF 

Eit(%) 

23.33 

28.09 

50.65 

16.07 

35.35 

60.94 

10.37 

24.14 

22.82 

69.02 

27.27 

BCut 

1.518 

4.881 

2.371 

2.633 

0.725 

3.621 

0.548 

1.183 

0.548 

3.058 

1.575 

LSD 

Eir(%) 

23.33 

17.98 

39.03 

18.45 

35.68 

47.93 

8.42 

22.68 

13.90 

67.81 

24.49 

BCut 

1.518 

5.399 

2.557 

2.523 

0.782 

2.082 

0.483 

0.918 

0.188 

2.056 

0.959 

NMFR 

Eit(%) 

22.00 

11.24 

38.06 

22.92 

36.33 

40.73 

2.08 

22.17 

13.13 

39.97 

fail 

BCut 

1.627 

4.318 

2.713 

2.556 

0.840 

1.467 

0.369 

0.992 

0.240 

1.241 

- 

Graclus 

ErT(%) 

BCut 

23.33 

1.534 

8.43 

4.293 

49.68 

1.890 

16.37 

2.414 

0.45 

0.589 

39.97 

1.581 

1.67 

0.350 

19.75 

0.815 

10.93 

0.092 

60.69 

1.431 

2.43 

0.440 

MTV 

Eir(%) 

22.67 

18.54 

34.52 

22.02 

7.72 

48.40 

4.11 

15.13 

20.55 

72.18 

3.77 

BCut 

1.508 

5.556 

2.433 

2.500 

0.774 

2.346 

0.374 

0.940 

0.193 

3.291 

0.458 

Ours 

ErT(%) 

BCut 

23.33 

1.495 

6.74 

4.168 

50.00 

1.890 

16.96 

2.399 

0.45 

0.589 

60.46 

1.056 

1.71 

0.350 

19.72 

0.802 

19.95 

0.079 

79.51 

0.895 

2.37 

0.439 


Transductive Setting: As in ifTSll . we randomly sample either one label or a fixed percentage of 
labels per class from the ground truth. We report clustering errors and the cuts (RCC-asym) for both 
methods for different choices of labels. For label experiments their initialization strategy seems to 
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work better as the cuts improve compared to the unlabeled case. However, observe that in some 
cases their method seems to fail completely (Iris and 4moons for one label per class). 


Labels 



Iris 

wine 

vertebral 

ecoli 

4moons 

webkb4 

optdigits 

USPS 

pendigits 

20news 

MNIST 


MTV 

En(%) 

33.33 

9.55 

42.26 

13.99 

35.75 

51.98 

1.69 

12.91 

14.49 

50.96 

2.45 


BCut 

3.855 

4.288 

2.244 

2.430 

0.723 

1.596 

0.352 

0.846 

0.127 

1.286 

0.439 


Ours 

En(%) 

22.67 

8.99 

50.32 

15.48 

0.57 

45.11 

1.69 

12.98 

10.98 

68.53 

2.36 


BCut 

1.571 

4.234 

2.265 

2.432 

0.610 

1.471 

0.352 

0.812 

0.113 

1.057 

0.439 


MTV 

En(%) 

33.33 

10.67 

39.03 

14.29 

0.45 

48.38 

1.67 

5.21 

7.75 

40.18 

2.41 


BCut 

3.855 

4.277 

2.300 

2.429 

0.589 

1.584 

0.354 

0.789 

0.129 

1.208 

0.443 


Ours 

En(%) 

22.67 

6.18 

41.29 

13.99 

0.45 

41.63 

1.67 

5.13 

7.75 

37.42 

2.33 


BCut 

1.571 

4.220 

2.288 

2.419 

0.589 

1.462 

0.354 

0.789 

0.128 

1.157 

0.442 


MTV 

En(%) 

17.33 

7.87 

40.65 

14.58 

0.45 

40.09 

1.51 

4.85 

1.79 

31.89 

2.18 


BCut 

1.685 

4.330 

2.701 

2.462 

0.589 

1.763 

0.369 

0.812 

0.188 

1.254 

0.455 


Ours 

ErT(%) 

BCut 

17.33 

1.685 

6.74 

4.224 

37.10 

2.724 

13.99 

2.461 

0.45 

0.589 

38.04 

1.719 

1.53 

0.369 

4.85 

0.811 

1.76 

0.188 

30.07 

1.210 

2.18 

0.455 


MTV 

En(%) 

18.67 

7.30 

39.03 

13.39 

0.38 

40.63 

1.41 

4.19 

1.24 

27.80 

2.03 

10% 

BCut 

1.954 

4.332 

3.187 

2.776 

0.592 

2.057 

0.377 

0.833 

0.197 

1.346 

0.465 

Ours 

En(%) 

14.67 

6.74 

33.87 

13.10 

0.38 

41.97 

1.41 

4.25 

1.24 

26.55 

2.02 


BCut 

1.960 

4.194 

3.134 

2.778 

0.592 

1.972 

0.377 

0.833 

0.197 

1.314 

0.465 


6 Conclusion 


We presented a framework for directly minimizing the balanced fc-cut problem based on a new con¬ 
tinuous relaxation. Apart from ratio/normalized cut, our method can also handle new application- 
specific balancing functions. Moreover, in contrast to a recursive splitting approach ll25l . our method 
enables the direct integration of prior information available in form of must/cannot-link constraints, 
which is an interesting topic for future research. Finally, the monotonic descent algorithm proposed 
for the difficult sum-of-ratios problem is another key contribution that is of independent interest. 
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